Software Projects - Exploratory Data Analysis

Lacrimae rerum. Memento mori. Memento vivere.

5-Factor Asset Pricing Components

Separated into components, factors can be seen as proxies for characteristics of equities and other assets which explain performance and provide premiums due to a relative risk. With the Fama-French 5-Factor Model for asset pricing, these risk factors consider aspects of market beta, market capitalization, book-to-market equity, operating profitability, and change in investment assets (recent performance momentum was excluded from this analysis). The data is accessed from the online library provided by Kenneth French, which highlights returns relevant to the research into asset pricing models from Eugene Fama and Kenneth French. As accessing the data through the online library is segmented by type, the overall data was iteratively collected and stored as variables in a PKL file or as sheets in an XLSX file for all relevant types. An exploratory analysis was performed to show the distributions, time-varying characteristics, and correlations of the data. The primary packages used in the project include Python with Numpy, Pandas, Matplotlib, Seaborn, Urllib, and Pickle.

https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

Data Considerations

In the construction of the data, the returns are in USD and include dividends and capital gains for the period without fees or taxes and without continuous compounding (unless specified as annualized). The return from market beta is equal to the difference in return between a value-weighted market portfolio and the 1-month U.S. Treasury bill as the risk-free rate. For the value factor (High Minus Low), profitability factor (Robust Minus Weak), and investment factor (Conservative Minus Aggressive), the portfolios are sorted into 2 groups for market capitalization (with the upper 90% of equities with the highest and lower 10% of equities with the lowest market capitalization) and 3 groups respectively for book-to-market equity, operating profitability, or change in investment assets (with breakpoints at the 30th and 70th percentiles for the relevant multiples). The return from the size factor (Small Minus Big) is the average return using the equally-weighted combinations of groups which were formed using the value factor, profitability factor, and investment factor.

Definition used in the data for market beta relative to the risk-free rate:

\begin{gather*} \text{Mkt-Rf} = \text{Value-Weighted Market Portfolio} - \text{Risk-Free Rate} \end{gather*}

Definition used in the data for the size factor and categorized as Small Minus Big:

\begin{gather*} \begin{split} \text{SMB} &= \frac{1}{3} \left(\frac{1}{3} (\text{Small Value} + \text{Small Neutral} + \text{Small Growth}) - \frac{1}{3} (\text{Big Value} + \text{Big Neutral} + \text{Big Growth})\right. + \cdots \\ & \cdots + \frac{1}{3} (\text{Small Robust} + \text{Small Neutral} + \text{Small Weak}) - \frac{1}{3} (\text{Big Robust} + \text{Big Neutral} + \text{Big Weak}) + \cdots \\ & \cdots + \left.\frac{1}{3} (\text{Small Conserv.} + \text{Small Neutral} + \text{Small Aggr.}) - \frac{1}{3} (\text{Big Conserv.} + \text{Big Neutral} + \text{Big Aggr.})\right) \end{split} \end{gather*}

Definition used in the data for the value factor and categorized as High Minus Low:

\begin{gather*} \text{HML} = \frac{1}{2} (\text{Small Value} + \text{Big Value}) - \frac{1}{2} (\text{Small Growth} + \text{Big Growth}) \end{gather*}

Definition used in the data for the profitability factor and categorized as Robust Minus Weak:

\begin{gather*} \text{RMW} = \frac{1}{2} (\text{Small Robust} + \text{Big Robust}) - \frac{1}{2} (\text{Small Weak} + \text{Big Weak}) \end{gather*}

Definition used in the data for the investment factor and categorized as Conservative Minus Aggressive:

\begin{gather*} \text{CMA} = \frac{1}{2} (\text{Small Conservative} + \text{Big Conservative}) - \frac{1}{2} (\text{Small Aggressive} + \text{Big Aggressive}) \end{gather*}

With regard to the regions, countries are grouped based on their classification as developed markets or emerging markets (which generally follows classifications from MSCI) and relative location with the extent of the data varying based on availability. The developed markets include Australia, Austria, Belgium, Canada, Switzerland, Germany, Denmark, Spain, Finland, France, Great Britain, Greece, Hong Kong, Ireland, Italy, Japan, Netherlands, Norway, New Zealand, Portugal, Sweden, Singapore, and United States. The European regions include Austria, Belgium, Switzerland, Germany, Denmark, Spain, Finland, France, Great Britain, Greece, Ireland, Italy, Netherlands, Norway, Portugal, and Sweden. The Asia Pacific regions include Australia, Hong Kong, Japan, New Zealand, and Singapore. The emerging markets include Brazil, Chile, China, Colombia, Czech Republic, Egypt, Greece, Hungary, India, Indonesia, Malaysia, Mexico, Pakistan, Peru, Philippines, Poland, Qatar, Saudi Arabia, South Africa, South Korea, Taiwan, Thailand, Turkey, and United Arab Emirates.

Overall Global Performance

For the most diverse and broad evaluation, the yearly return can be calculated for developed markets and emerging markets. From this, it is clear that market beta, size factor, value factor, profitability factor, and investment factor realized positive and significant premiums on average and consistently across each of the individual regions. As explored, there is an empirical reason based on higher risk (through higher discount rates) for an expected premium to be realized - if there was no empirical reason, it is more likely that any result may be due to data mining, random chance, or inconsistent effects, where there would be no genuine expectation for the premium to persist in the future. Interestingly, the size factor appears less reliable and weaker than the other factors and may not provide a robust premium across time (as is somewhat understandable, as the empirical case for the existence of the size factor is less established).

Yearly realized returns for the components of the Fama-French 5-Factor Model in developed markets:

Yearly realized returns for the components of the Fama-French 5-Factor Model in emerging markets:

Individual Regions Performance

Individual regions, including the United States, Europe, Japan, and Asia Pacific excluding Japan, can be considered to evaluate the pervasiveness of the factors. From this, it is clear that market beta, value factor, profitability factor, and investment factor realize positive and significant premiums on average and consistently across each of the individual regions. To highlight, the size factor in isolation often actually does not provide a robust premium across time.

Yearly realized returns for the components of the Fama-French 5-Factor Model in the United States:

Yearly realized returns for the components of the Fama-French 5-Factor Model in Europe:

Yearly realized returns for the components of the Fama-French 5-Factor Model in Japan:

Yearly realized returns for the components of the Fama-French 5-Factor Model in Asia Pacific excluding Japan:

Independence And Intersection

As a measure of the association between the factors, the correlations can be considered relative to the yearly realized returns. These considerations can be related to Modern Portfolio Theory from Harry Markowitz, where it is asserted that diversification and reduced risk of losses can be achieved by minimizing the correlation of assets within a portfolio. In other words, between assets within a portfolio with positive expected returns, a perfect positive correlation increases the standard deviation of the portfolio, but an imperfect positive or negative correlation will always decrease the standard deviation. When constructing a portfolio, this allows for the optimization of the expected return against a certain level of risk (assuming standard deviation is a suitable proxy for risk). However, it should also be kept in mind that, due to changes between periods, these correlations may not necessarily be fixed and could be clustered based on different periods.

The correlations also reveal whether the factors are actually distinct in their definitions. In a sense, it would be expected for the factors to act as independent principal components, although they have been identified beforehand based on empirical reasoning. However, as there may be shared qualities in their definitions, there may be similarities in their results in different periods and they may not be completely disconnected. It should also be acknowledged that it would be expected for the correlations between factors to increase as the number of factors increase, as there is a finite spectrum of unique information which can be extracted and overlaps have to occur as more factors are added to imperfectly fill in the remaining gaps based on their definition. This is seen in the incremental but subsiding improvements between the Capital Asset Pricing Model, Fama-French Three-Factor Model, and Fama-French Five Factor Model.

Correlations of yearly realized returns for the components of the Fama-French 5-Factor Model in developed markets:

Correlations of yearly realized returns for the components of the Fama-French 5-Factor Model in emerging markets:

Correlations of yearly realized returns for the components of the Fama-French 5-Factor Model in the United States:

Correlations of yearly realized returns for the components of the Fama-French 5-Factor Model in Europe:

Correlations of yearly realized returns for the components of the Fama-French 5-Factor Model in Japan:

Correlations of yearly realized returns for the components of the Fama-French 5-Factor Model in Asia Pacific ex. Japan:

Software Architecture Overview

The project was designed with an object-oriented approach using classes for each part. As the metadata class, the selection and reference of data is controlled through a CSV file specifying a title to assign for display, region for which the data is applicable, label to use as the variable name, URL from which to access the data, and indices of the relevant sections and columns within the data. With regard to these indices, they need to be manually assigned based on which sections should be extracted, as well as which columns should be extracted within the section, while labels are also required and form part of the data frames. Specifically, the ascribed columns include Title, Region, Label, URL, SetsIndex, SetsTitle, SetsLabel, ColumnsName, and ColumnsIndex (although purposefully designed to work with the online library provided by Kenneth French, the subsequent operations would work with any file of the same structure).

Example of a CSV detailing the information to generate an object using the metadata class:

				Title, Region, Label, URL, SetsIndex, SetsTitle, SetsLabel, ColumnsName, ColumnsIndex
				Five Factor Model, Developed Markets, five_factor_dv, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Developed_5_Factors_CSV.zip, 0; 1, Monthly Data; Yearly Data, data_month; data_year, "Date, Mkt-RF, SMB, HML, RMW, CMA, RF; Date, Mkt-RF, SMB, HML, RMW, CMA, RF", "0, 1, 2, 3, 4, 5, 6; 0, 1, 2, 3, 4, 5, 6"
				Five Factor Model, Emerging Markets, five_factor_em, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Emerging_5_Factors_CSV.zip, 0; 1, Monthly Data; Yearly Data, data_month; data_year, "Date, Mkt-RF, SMB, HML, RMW, CMA, RF; Date, Mkt-RF, SMB, HML, RMW, CMA, RF", "0, 1, 2, 3, 4, 5, 6; 0, 1, 2, 3, 4, 5, 6"
				Five Factor Model, United States, five_factor_us, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip, 0; 1, Monthly Data; Yearly Data, data_month; data_year, "Date, Mkt-RF, SMB, HML, RMW, CMA, RF; Date, Mkt-RF, SMB, HML, RMW, CMA, RF", "0, 1, 2, 3, 4, 5, 6; 0, 1, 2, 3, 4, 5, 6"
				Five Factor Model, Europe, five_factor_eu, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Europe_5_Factors_CSV.zip, 0; 1, Monthly Data; Yearly Data, data_month; data_year, "Date, Mkt-RF, SMB, HML, RMW, CMA, RF; Date, Mkt-RF, SMB, HML, RMW, CMA, RF", "0, 1, 2, 3, 4, 5, 6; 0, 1, 2, 3, 4, 5, 6"
				Five Factor Model, Japan, five_factor_jp, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Japan_5_Factors_CSV.zip, 0; 1, Monthly Data; Yearly Data, data_month; data_year, "Date, Mkt-RF, SMB, HML, RMW, CMA, RF; Date, Mkt-RF, SMB, HML, RMW, CMA, RF", "0, 1, 2, 3, 4, 5, 6; 0, 1, 2, 3, 4, 5, 6"
				Five Factor Model, Asia Pacific ex Japan, five_factor_as, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Asia_Pacific_ex_Japan_5_Factors_CSV.zip, 0; 1, Monthly Data; Yearly Data, data_month; data_year, "Date, Mkt-RF, SMB, HML, RMW, CMA, RF; Date, Mkt-RF, SMB, HML, RMW, CMA, RF", "0, 1, 2, 3, 4, 5, 6; 0, 1, 2, 3, 4, 5, 6"

The metadata class will separate each row in the CSV file as a source to be used. For each source, an object is created through the source class to handle the retrieval and extraction of the target data. An objected created as a source class will consist of identification properties from the associated metadata class, as well as the raw and processed data as dataframes. The raw data is simply the retrieved data formatted numerically with NaN values occupying incompatible records (downloaded version of the original data is also optionally stored). This results in each section being continuous sets of numeric values with groups of NaN values between them and allows for the simple identification of sections based on the alignment of these values. So, the processed data extracts the relevant sections and labels them based on the variable names from the metadata. The relevant columns for the corresponding sets are then simply selected in each of the extracted sets (in all cases, it is necessary to select the column for dates, which is formatted as datetime values).

Illustration of the original, raw, and processed data and transformations in each step:

For analysis, several plots are available depending on the type of the source. For the premiums from individual factors, it is possible to visualize the history of realized returns since inception with moving averages for various lengths of time; distribution of realized returns as a histogram and kernel density estimation with identification of several metrics (such as mean, median, maximum, minimum, standard deviation, skewness, and kurtosis); and association internally between the factors with the Pearson, Kendall, and Spearman correlation coefficients and scatter plots showing a linear regression model. For the portfolios constructed based on factors, it is possible to visualize the history of realized returns since inception with moving averages for various lengths of time; distribution of realized returns as a histogram and kernel density estimation with identification of several metrics (such as mean, median, maximum, minimum, standard deviation, skewness, and kurtosis); cumulative realized returns with vintages beginning from each point in time and progressing through time on a linear or log scale with comparison against fixed returns and identification of several metrics (such as average returns for various lengths of time, time until a cumulative return of 10 times, and time for drawdowns until a positive return); and annualized returns from the compound annual growth rate from each point in time and progressing through time with a heatmap to illustrate the impacts of events and identification of several metrics (such as dispersion of results for various lengths of time - although the results appear to converge, a small compounding difference can have drastically divergent outcomes over long periods of time, as seen with the variation of cumulative realized returns).

Finally, for organization, a collection class has been created to hold and manage the sources as properties. This allows for designated analysis based on the types of the sources (such as yearly data compared to monthly data or individual factor premiums compared to portfolio constructions), as well as saving the dataframes of the sources as separate sheets in an XLSX file - alternatively, it is possible to easily save the entire collection class as a PKL file through the Pickle module.

Latest public version available online through the Git repository hosted on GitLab: